บทนำสู่การเตรียมข้อมูลส่วนตัวใน RAG

พื้นฐานของ RAG

โมเดลภาษาขนาดใหญ่แบบมาตรฐาน (LLMs) ถูก “แช่แข็ง” ไว้ตามเวลา จำกัดด้วยข้อมูลการฝึกที่หยุดไว้ พวกเขาไม่สามารถตอบคำถามเกี่ยวกับคู่มือภายในของบริษัทคุณ หรือการประชุมวิดีโอส่วนตัวเมื่อเมื่อวานนี้ได้การสร้างเนื้อหาเพิ่มเติมโดยการดึงข้อมูล (RAG)ช่วยปิดช่องว่างนี้โดยให้โมเดลภาษาทราบบริบทที่เกี่ยวข้อง ซึ่งดึงมาจากข้อมูลส่วนตัวของคุณเอง

กระบวนการหลายขั้นตอน

เพื่อให้ข้อมูลส่วนตัวเป็นสิ่งที่โมเดลภาษาเข้าใจได้ เราจะดำเนินตามกระบวนการเฉพาะดังนี้:

โหลด:แปลงรูปแบบต่างๆ (เช่น PDF, เว็บไซต์, ยูทูบ) เป็นรูปแบบเอกสารมาตรฐาน
แบ่งแยก:แบ่งเอกสารยาวออกเป็นชิ้นเล็กๆ ที่จัดการได้ง่าย
การแปลงเวกเตอร์:แปลงชิ้นส่วนข้อความเป็นเวกเตอร์เชิงตัวเลข (ตัวแทนทางคณิตศาสตร์ของความหมาย)
การจัดเก็บ:เก็บเวกเตอร์เหล่านี้ไว้ในระบบจัดเก็บเวกเตอร์ (เช่น Chroma) เพื่อค้นหาความคล้ายคลึงกันอย่างรวดเร็ว

ทำไมการแบ่งชิ้นถึงสำคัญ

โมเดลภาษาขนาดใหญ่มี “หน้าต่างบริบท” (จำกัดจำนวนข้อความที่สามารถประมวลผลพร้อมกันได้) หากคุณส่งไฟล์ PDF 100 หน้า โมเดลจะล้มเหลว เราแบ่งข้อมูลเป็นชิ้นเล็กๆ เพื่อให้แน่ใจว่าส่งเฉพาะข้อมูลที่เกี่ยวข้องที่สุดไปยังโมเดลเท่านั้น

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

Question 1

Why is chunk_overlap considered a critical parameter when splitting documents for RAG?

To reduce the total number of tokens used by the LLM.

To ensure that semantic context (the meaning of a thought) is not cut off at the end of a chunk.

To make the vector database store data faster.

Challenge: Preserving Context

Apply your knowledge to a real-world scenario.

You are loading a YouTube transcript for a technical lecture. You notice that the search results are confusing "Lecture 1" content with "Lecture 2."

Task

Which splitter would be best for keeping context like "Section Headers" intact?

Solution:
MarkdownHeaderTextSplitter or RecursiveCharacterTextSplitter. These allow you to maintain document structure in the metadata, helping the retrieval system distinguish between different chapters or lectures.